.\" $Id$

.TH IO-WATCHDOG 1 "@META_DATE@" "@META_ALIAS@" "IO Watchdog"

.SH NAME
io-watchdog \- IO Watchdog driver utility

.SH SYNOPSIS
.B io-watchdog
[\fIOPTION\fR]... executable args...

.SH DESCRIPTION
The IO watchdog is a facility for monitoring user applications and 
parallel jobs for hangs which typically have a side-effect of ceasing
all IO in an application (i.e. one that writes something to a log or
data file during each cyle of computation.).  The watchdog attempts to 
monitor all IO generated from an application and triggers a set of 
user-defined actions when IO has stopped for a configurable timeout period.
.PP
The \fBio-watchdog\fR utility is used to start applications under the
IO watchdog service and configure watchdog parameters. This utility
is typically needed when a job is run outside SLURM. When jobs
are run under SLURM, the \-\-io-watchdog option to \fBsrun\fR(1)
can be used when the io-watchdog plugin is enabled. The \fBsrun\fR
option \-\-io-watchdog=help can be used to display a short help
message for running the IO watchdog under SLURM.

.SH OPTIONS
.TP
.BI "-h, --help"
Display a summary of the command-line options.
.TP
.BI "-v, --verbose"
Increase the verbosity of the io watchdog.
.TP
.BI "-q, --quiet"
Decrease the verbosity of the io watchdog.
.TP
.BI "-l, --list-actions"
List pre-defined system and user actions then exit. System actions
are those scripts found in the default search path, or any search
paths added to the system io-watchdog.conf file.
.TP
.BI "-f, --config=" FILE
Load configuration from file FILE. The config file specified 
will override the default user config file ~/.io-watchdogrc. This
option may be a convenient way to replace the use of other
watchdog configuration options below.
.TP
.BI "-t, --timeout=" NUMBER[SUFFIX]
Set watchdog timeout to NUMBER seconds. If supplied SUFFIX may be 's' 
for seconds, 'm' for minutes, 'h' for hours, or 'd' for days.
NUMBER may be an arbitrary floating-point number.
.TP
.BI "-a, --action=" SCRIPT,...
Run action SCRIPT on watchdog trigger. More than one action
may be specified. An action should be either a full path to
a script, one of the scripts in the default or user-defined
search path, or an action defined in the config file as 

 action name = "script args..."

The following environment variables will be set by \fBio-watchdog\fR
in each action script:
.RS
.TP
.B "IO_WATCHDOG_TIMEOUT"
The time in seconds of the current \fBio-watchdog\fR timeout.
Note that this is not the amount of time since the last application
write, but instead just the timeout value as set by the user. The
\fBIO_WATCHDOG_TIMEOUT\fR value is expressed as a floating-point
number because this is how the timeout is represented internally
in \fBio-watchdog\fR (e.g. 2 minutes would be represented as
"IO_WATCHDOG_TIMEOUT=120.000").
.TP
.B "IO_WATCHDOG_TARGET"
The name of the process being monitored by the \fBio-watchdog\fR,
and which presumably caused a timeout.
.TP
.B "IO_WATCHDOG_PID"
The PID of the monitored process.
.RE

.TP
.BI "-T, --target=" PATTERN
Only target processes with names matching the shell globbing
pattern PATTERN (See glob(7)). This option is useful when 
running applications under other processes such as \fBtime\fR(1),
as it avoids enabling the io watchdog on the \fBtime\fR process.
.TP
.BI "-r, --rank=" N
Only target rank N (default = 0) of a SLURM job.

.TP
.BI "-m, --method=" TYPE
Specify an alternate method for the IO watchdog timeout.  Valid values for
\fITYPE\fR are \fIsloppy\fR or \fIexact\fR.  By default the IO watchdog
uses the \fIsloppy\fR method, which is a simple and very lightweight
method for tracking IO, and thus has very little impact on application
performance. With the \fIsloppy\fR method, however, the actual time at
which the watchdog will time out may be anywhere from N to 2*N, where N is
the timeout period selected by the user.  When the \fIexact\fR timeout
method is selected, the IO watchdog timestamps application IO which
allows more precise calculation of the next timeout interval. However,
the \fIexact\fR method will have a greater impact on application write
performace, because an extra call to \fIgettimeofday\fR(2) is generated
for every write call.  The performance impact of the \fIexact\fR method
is dependent on the rate of calls to write, not necessarily the amount
of data written.

.TP
.B "-p, --persistent"
The \fBio-watchdog\fR server process will not exit after a watchdog
timeout occurs. This will enable multiple timeouts and possibly
multiple invocations of \fBio-watchdog\fR action scripts. The default
behavior is to have the \fBio-watchdog\fR server process exit after
the first timeout event. \fINOTE:\fR This does not necessarily mean
that the monitored process itself has exited or been killed by the
\fBio-watchdog\fR at this time.


.SH SEE ALSO
\fBio-watchdog.conf\fR(5) 
